Spontaneous Self- Assembly of Transcription Factor Based Gene Regulation Networks 
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We model the transcription factor based regulation network of yeast using a content-based network 
model that mimicks the recognition of binding motifs on the regulatory regions of the genes. We are 
thereby able to faithfully reproduce many of the topological features of the gene regulatory network 
of yeast once the parameters of the yeast genome, in particular the distribution of information coded 
by the "binding sequences" within the promoter regions is provided as input. The length distribution 
for the promoter regions is fixed by comparing the k-core analysis of the model network with that 
of yeast. Our results strongly point to the possibility that the observed topological features are 
generic to networks formed via sequence-matching between random strings obeying certain length 
distributions. 
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I. INTRODUCTION 

Development of new experimental techniques, such as 
DNA microarrays, in the late 1990's 0, Q made a huge 
impact on cell biology research. Such experiments gen- 
erated a flood of expression data for several well-studied 
single-cell species for which we now have an almost com- 
plete list of not only the genes, but also the interac- 
tions between them. A cell is able to survive, grow and 
replicate due to the collective actions of its genes. The 
adaptation and robustness of its activities in a constantly 
changing environment is maintained by the complex net- 
work of interactions between the genes. 

The regulation of gene expression in a cell relies to a 
major extent on dedicated proteins called transcription 
factors (TFs). §| These proteins come with a structure 
suited to recognize and bind the DNA at specific loca- 
tions called binding sites. The binding affinity of a TF 
on a certain DNA segment is determined by the base se- 
quence at the location. Each TF preferentially binds cer- 
tain regulatory sequences or binding motifs, within the 
promoter regions (PRs) responsible for the regulation of 
the gene. In the case of yeast, Saccharomyces cerevisiae, 
a list of the binding motifs for more than 100 TFs has re- 
cently been provided. 0, It was also reported [j| that 
the TF binding sites are located with high probability 
within a window of several hundred bases upstream of the 
transcription activation site (preceding the start codon of 
the gene), although longer-distance action is also possi- 
ble. In fact, the existence of a high- affinity binding motif 
in a promoter region is a necessary but not sufficient con- 
dition for TF-based expression regulation 5]. Moreover, 
especially in eukaryotic cells, gene regulation relies on the 
simultaneous action of multiple TFs. 

We argue that the global features of the gene regula- 
tion network depend very little on such details and are 
largely determined by the distribution of the amount of 
shared information or content, that is required for the 



establishment of regulatory interactions. It may be con- 
jectured that information sharing and its distribution is 
the basic organizing principle which is responsible for the 
universality of the degree distribution of gene regulatory 
networks across diverse species 

In this paper we propose to model the transcription 
regulation network of yeast using the ideas of the content- 
based model we introduced earlier 0, ■ We are able to 
faithfully reproduce all the topological aspects of the gene 
regulatory network of yeast when the parameters of the 
yeast genome, in particular the distribution of informa- 
tion coded by the "binding sequences" of the regulatory 
segments, are given as input. We compare the ensem- 
ble of the resulting model networks with the data on the 
yeast regulatory network available in different databases. 

Gene regulatory networks can be naturally described 
as a directed graph where the nodes are the genes. A 
directed edge from node A to node B implies that the 
transcription factor produced by gene A regulates the 
activity of gene B. Since the edges are directed, one dis- 
tinguishes the in-degree (the number of incoming edges) , 
the out-degree (number of outgoing edges) and the total 
degree of a node, each with their own (possibly distinct) 
probability distributions. These distributions serve as 
distinguishing features of the network which a realistic 
model is expected to reproduce. Further structural as- 
pects of these networks are probed by measures such as 
the clustering coefficient C(k) the degree-degree 

correlation between connected vertices |Tl|. the "rich- 
club coefficient" 0,^3, or the A:-core decomposition ^3 
recently employed to p redict new interactions in various 
biological systems [ll EI [13 III El- 

This report is organized as follows: In Section [n] we 
introduce our model, which we compare with the exper- 
imentally determined yeast regulatory network in lllll A 
discussion is provided in Section llVl while SectionEI out- 
lines our methods. 
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II. THE MODEL 

The nodes of our model network correspond to genes. 
We differentiate between genes which code for a Tran- 
scription Factor (TF) and those which do not. All genes 
are assumed to be possible targets of regulation by one 
or more TFs. Each node has a sequence associated with 
it, representing the promoter region (PR) through which 
the corresponding gene may be regulated. We pick a 
given percentage of nodes (around 5%, see Table I) at 
random, to represent TF-producing genes. With each 
TF-producing node/gene we also associate a second se- 
quence, which stands for the binding motif, which the TF 
recognizes and binds in the promoter region of another 
gene. 

We represent both the binding motifs and the PRs as 
random binary sequences of variable length. The mech- 
anism for establishing connections between nodes of the 
gene regulatory network is given by a string matching 
condition 0, 0> between the binding motifs of the TF's 
and all possible uninterrupted subsequences of the PRs. 
The (directed) network of regulatory gene interactions is 
then obtained by connecting each TF-producing node A 
to all those nodes B, B', B" . . . whose PRs contain the 
binding motif associated with node A. The amount of 
information coded in these randomly generated binding 
motifs and promoter regions constitutes the essential in- 
gredient of our model and dictates the overall topology 
of the resultant networks. 

Experimentally determined TF binding motifs are typ- 
ically short sequences with a narrow length distribution, 
since a TF selectively binds 5-10 bases and not much 
more. A single TF can bind a range of similar motifs, 
and the relative frequencies of the four bases at each po- 
sition within the motif contribute to the information ex- 
changed in the binding process. The promoter regions 
(PRs) which lie in the intergenic portions of the genome 
are typically longer and may accommodate several bind- 
ing motifs (as shown in FigJ ]! to allow graded and/or 
combinatorial regulation 0, |5j • 

The bitwise length distribution of the model binding 
motifs was derived from the yeast data provided by Har- 
bison ct al. in The motifs were reported [j| as let- 
ter sequences comprising the symbols for the four bases 
{ATGC}, or the symbols {YMKRSW} for incompletely 
specified bases, with the corresponding lower case letters 
indicating a lower confidence level. In order to account 
for such variations in the information content of the mo- 
tifs, we assigned two bits to each of the letters {ACTG} 
appearing in the motif, signifying a high information con- 
tent at that position, and one bit otherwise. The length 
of the bit sequence obtained in this way roughly corre- 
sponds to the amount of shared information, measured 
by the Shannon entropy pel ] , required for the bindingof 
the TF. Performing this calculation for each TF in pj, 
we obtain the length distribution shown in Fig. [21 

In choosing the length distribution of the promoter re- 
gions, about which less is known, we are guided by the 
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FIG. 1: The mechanism of interaction between the genes as 
envisaged in our model. The genes are indicated by ellipses 
(green if TF-coding, blue otherwise), the transcription fac- 
tors by triangles with the associated binding motif in the 
box underneath. Non-TF proteins are symbolized by the "P" 
shape, and the promoter regions (PR) upstream of each gene 
are shown as red boxes. Binding occurs if the binding motif 
matches a subsequence in the PR, as is the case here at PR4. 
PRs in the model are typically much longer than depicted 
here. 



finding 5] that most of the probability for encountering 
a TF binding site is contained within a window of 250 
base pairs (bps) located approximately 100 bps upstream 
of a gene. The PR length distribution that we adopt 
within this range decays with a power law pit) oc 
with < u < 2 after the findings of Almirantis and 
Provata [21| for the lengths of intergenic regions. We 
also assign a minimum length chosen to coincide with 
the peak of the motif-length distribution shown in Fig.0 
Note that the 250 bps window does not double as we 
move from the 4 letter alphabet to a binary one, because 
the matching probabilities and the total number of posi- 
tions at which the TFs may bind are required to remain 
invariant under this transformation. 

The value of /i remains as the only adjustable param- 
eter in our model, and is determined by comparing the 
fc-core decomposition of the gene regulatory network of 
yeast as extracted from experimental data (Table I) with 
our content-based network model, as explained in the 
Methods section. 

The collection of such model networks forms an en- 
semble whose features are a direct consequence of the 
string-matching mechanism and the length distributions. 
Clearly, each realization of the model will result in a dif- 
ferent collection of random PRs and binding motifs, and 
hence a somewhat different network. These features turn 
out to be strikingly distinct from those encountered in 
random or scale-free networks. We show below 
that the "signatures" of this ensemble are shared by the 
yeast regulatory network. 
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FIG. 2: Distribution of the amount of bitwise information 
coded by each regulatory sequence recognized and bound by 
the 102 TFs in the yeast genome (compiled from the recently 
published data by Harbison et al. [f|). This distribution is 
adopted as the length distribution of the random regulatory 
sequences ("binding motifs") in our model. 

III. RESULTS 

Our purpose here is to show that the experimentally 
determined features of the yeast regulation network fol- 
low closely those typical of the ensemble defined by our 
model. The topological features we will focus on are the 
following: 

1. degree distribution (in-, out-, and total): the 
distribution of the number of connections of the 
nodes in a network. 

2. clustering coefficient: the modularity of the net- 
work. 

3. degree-degree correlations: average degree of 
the neighbors of a node with degree fc. 

4. "rich-club" coefficient: a measure of the relative 
connectivity among nodes whose degree is higher 
than a given number. 

5. fc-core structure: the hierarchical structuring in 
the network 

The precise definition of these quantities is given in the 
Methods section. 

Here we will report the comparison of our results with 
the most recent Yeastract [24( data. Analogous com- 
parisons with each of the data sources listed in Table |I] 
yield similar results (sec Supplementary Material) show- 
ing that our conclusions are consistent with all the dif- 
ferent data sets available. 

In order to compare our results with the available data 
we generate an ensemble of realizations, with an average 



TABLE I: The number of interacting genes, TFs, and inter- 
acting pairs that appear in the yeast regulatory network as 
obtained from different sources. 



Source 


Genes 


TFs 


Interacting Pairs 


Fraenkel Lab" 


2884 


102 


6441 


Yeastract 6 


4252 


146 


12530 


Luscombe et al. c 


3459 


142 


7071 


Kirdar et al." 


3763 


180 


9135 



"http: / /fraenkel. mit.edu/Harbison/release_v24/bound_by .factor/ 
'http: / /www. yeastract. com 

c http: / /sandy, topnet.gersteinlab.org/index2. html 
^private communication 



of Ng — 6000 genes in total, 4167 of which contribute to 
the network on the average. Out of these, 202 (making up 
% 4.8 of the genes) are TF-coding genes, taking part in 
a total of 14365 interactions, again on the average. The 
corresponding values for the yeast regulatory networks 
reported in the publicly available data bases are given in 
Tabled 

The total degree distribution is obtained by ignoring 
the directionality of the interactions and is different from 
the superposition of in- and out-degree distributions. In 
Fig. UJt, Yeastract data for the degree distribution is 
shown on top of a scatter plot obtained by superposing 
the results from 100 artificial model genomes indepen- 
dently generated according to the rules described in Sec- 
tion [n] In Fig. we exhibit the in-degree distribution 
obtained from the Yeastract data, and the corresponding 
scatter plot. 

The out-degree distribution of the yeast and model 
networks exhibits a rather large scatter of points due to 
the relatively small number of TFs. Comparing with the 
scatter plot obtained from 100 realizations, we find again 
that the actual yeast data falls within the boundaries set 
by the model ensemble (Fig. [3J;) . 

In Fig. 0] we report the three topological coefficients, 
the clustering coefficient, the degree-degree correlation 
and the "rich-club" coefficient, that go beyond degree- 
distributions in characterizing the network. The agree- 
ment is extremely good; in particular, the shoulder ob- 
served in the "rich-club" coefficient in Fig. 0Jc), a fea- 
ture common to both gen e- regulation and protein-protein 
interaction networks [Tj, is captured accurately in our 
model. 

The agreement observed with the Yeastract data is not 
source-specific, as can be seen from a comparison of the 
topological properties of our model networks, with those 
obtained from the different sources listed in Table [U (see 
Supplement) 

Finally, in Fig. EI left, the fc-core analysis of the model 
network is shown, which should be compared with that 
of the Yeastract data on the right. The fc-core analy- 
sis provides a much more stringent characterization of a 
network than the other single topological features con- 
sidered above. To give an idea of the sensitivity of the 
fc-core analysis to the structure of the network, let us 
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FIG. 3: Degree distributions extracted from the Yeastract |24| data (red circles), superposed on the corresponding degree 
distributions of 100 realizations of the model network (black dots). From left to right, a) The total degree distribution with 
an inset showing a log-linear plot for fc/fc av < 10, where one may observe that both the model and the data points almost fall 
on a straight line, b) The in-degree distribution plotted on a semi-logarithmic scale, c) The out-degree distribution plotted 
on a log-log scale. The axes are scaled by the average total degree in order to factor out sample-to-sample fluctuations in the 
network size. 
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FIG. 4: Comparison of a) the clustering coefficient c(k), b) the degree-degree correlations between neighboring nodes k nn (k), 
and c) the rich-club coefficient r(k), from left to right, for 100 realizations of the model (black dots) and the Yeastract data 
(red circles). 



point out that, under a shuffling of the edges of the net- 29 rather than 9 as observed in both the real yeast regu- 
work keeping the degree of each node fixed, the typical latory network and the model (see Supplement), 
value of the maximum number of fc-cores, fc max , becomes 



IV. DISCUSSION 

The close structural similarity between the model and 
the real yeast regulatory network, with respect to a di- 
verse set of criteria, shows that they are part of the 
same statistical ensemble of networks, formed by random 
strings connected by the sequence matching rule. 

The sequence matching rule could more generally be 
viewed as an information-theoretical constraint, where 
the interaction between two genes requires the fulfillment 



of a set of conditions which we symbolically represent 
as the matching of two random sequences. The more 
stringent the prerequisites of the interaction, the longer 
is the random "binding motif that is to be matched. 
The length of the PR establishes the size of the phase 
space in which the motif is to be sought. The properties 
of the network are then determined by the distributions 
obeyed by the lengths of the binding motifs as well as the 
promoting regions. 

Interpreted within this information-theoretical frame- 
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FIG. 5: Left: The fc-core decomposition of a single realization of our model network obtained with the visualization tool 
lanet-vi |25ll . The length distribution exponent of the PR sequences has been adjusted to /j, — 0.1 to optimize the similarity 
with the fc-core distribution of the Yeastract data (Right). Dots represent the nodes of the network, while edges between nodes 
depict connections. Nodes belonging to different fc-shells are indicated by different colors (on the right hand side) and are 
arranged around concentric circles, whose average radius decreases with k. In particular, a node of a given shell is placed just 
inside (outside) the corresponding circle, if it is preferentially connected to lower (higher) k-shells. The size of dots indicate 
the degree of the respective nodes; see legends to the left of the figures. 



work, our model has sufficient generality to accommo- 
date other interactions based on lock-and-key mecha- 
nisms, such as protein networks, where the interactions 
are dictated by certain steric and chemical conditions. 

The topological features of the networks investigated 
here and shown to be shared by the yeast regulatory net- 
work strongly point to the possibility that these networks 
did not have to be assembled from scratch, but rather 
emerged spontaneously, given any sufficiently long lin- 
ear code. This proposition by no means minimizes the 
role of evolutionary pressures on such networks; instead, 



V. METHODS 

The degree k of a node is the number of edges con- 
nected to it. When the graph is directed, one distin- 
guishes in-, out-, and total-degrees of a node, with their 
corresponding distributions. In the measures below we 
have ignored the directionality of the network. 

The clustering coefficient is given by the formula: 

A, 

i ~ ki(ki - l)/2 ' 

where Aj is the number of triangles that contain node i. 
The quantity C(k) plotted in Fig. 0] is the average of Ci 
over the nodes with degree k. 

The degree-degree correlation function k nn (k) is 

k nn (k) =J2^p(k'\k), 

k' 



it suggests that a network with essentially the current 
topology could have provided a starting point for fur- 
ther fine-tuning. As a case in point, it has recently been 
demonstrated that evolution under duplication and di- 
vergence [2(| may leave the to polo gical features of such 
networks essentially invariant [22J. Such a perspective 
will hopefully bring us a step closer to envisioning how 
complex structures may have come into existence, by 
shifting some of the load from the shoulders of evolution 
onto the laws of probability. 



I 

where p(k'\k) is the conditional probability that a node 
with degree k is connected to a node with degree k! . 

The "rich-club" coefficient [H r(k) is the to- 
tal number e >k of edges connecting nodes with degree 
greater than k, normalized by the maximum possible 
number of such connections, 

r (k) = 2e > k 
V ' N >k {N >k -\y 

where N >k is the total number of nodes with degree 
greater than k. 

The A:-core decomposition performs a successive prun- 
ing on the least connected vertices of a network At 
each step one removes all nodes with a degree less than 
k along with their edges and continues in this manner 
until all nodes have at least degree k. The remaining 
nodes constitute the k core. Next, k is incremented by 
one, and the process is repeated until no nodes are left. 
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The fc-shell is defined as the set of nodes that belong to 
the fc-core, but not the (fc + l)-core. 

Once the shape of the TF length distribution, the 
width of the PR region, as well as the functional form 
of its distribution have been fixed through the available 
biological data, the only remaining adjustable parameter 
in our model is the exponent /i of the power law distribu- 
tion of PR lengths, p(l) oc The fc-core decomposi- 
tion turns out to provide the most detailed and stringent 
topological characterization of the network, with both 
the total number of shells, and the distribution of the 
nodes over the shells, being contained in the fc-core plots 
(see Fig|5j). The fc-core plots also incorporate such qual- 
itative features as inter- and intra-shell connectivity. We 
have therefore used qualitative and quantitative compar- 
ison of the fc-core plots for the Yeastract and the model 
network to determine /z. The best agreement was ob- 
tained for fi = 0.1. Once fj, has been fixed, no further 
adjustment is needed in order to obtain the extremely 
close matching that is found between the degree distri- 
butions, clustering coefficients, degree correlations and 
the rich-club coefficient, as displayed in Figs. |3| and 0] 



We cannot rule out the possibility of obtaining similar 
agreement between our model and the real genomic net- 
work with respect to the features considered here, for a 
different choice of the functional form of the length dis- 
tribution for the PR sequences, once more determining 
an adjustable parameter from a comparison of the fc-core 
plots. However, the present choice seems to be the only 
reasonable one within the physical constraints and the 
available information. 



VI. ACKNOWLEDGMENTS 

We would like to thank Betiil Kirdar and Beste 
Kimkoglu for the use of their data and useful discus- 
sions. It is a pleasure to thank Alessandro Vespignani 
and Ignacio Alvarez-Hamelin for bringing fc-core analysis 
to our attention, and for the use of their web-based fc-core 
analysis tool. AE would like to thank Tamas Vicsek and 
Andras Czirok for a useful discussion and is grateful for 
partial support from the Turkish Academy of Sciences. 



[1] Lockhart, D.J.,Winzeler, E.A. (1995) Nature 405, 827- 
36. 

[2] Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., 
Anders, K., Eisen, M.B., Brown, P.O., Botstein, D., 
Futcher, B. (1998) Molecular Biology of the Cell 9,3273- 
3297. 

[3] Alberts, B., Johnson, A., Lewis, J., Raff, M., Roberts, 
K., Walter, P. (2002) in Molecular Biology of the Cell. 
Chapter 9. (Garland Science, N.Y.). 

[4] Lee, T.I., Rinaldi, N.J., Robert, F., Odom, D.T., Bar- 
Joseph, Z., Gerber, G.K., Hannett, N.M., Harbison, 
C.T., Thompson, CM., Simon, I. et al. (2002) Science, 
298, 799-804. 

[5] Harbison, C.T., Gordon, D.B., Lee, T.I., Rinaldi, N.J., 
Macisaac, K.D., Danford, T.W., Hannett, N.M., Tagne, 
J.B., Reynolds, D.B., Yoo, J., et al. (2004) Nature 431, 
99-104. 

[6] Bergmann, S., Ihmels, J., Barkai, N. (2004) PloS Biol. 2, 
85-93. 

[7] Balcan, D., Erzan, A. (2004) Eur. Phys. J. B 38, 253. 
[8] Mungan, M., Kabakcioglu, A., Balcan, D., Erzan, A. 

(2005) J. Phys. A 38 (44), 9599-9620. 
[9] Dorogovstsev, S.N., Mendes, J.F.F. (2002) Adv. Phys. 

51, 1079-1187. 

[10] Watts, D.J. & Strogatz, S.H. (1998) Nature (London) 
393, 440-442. 

[11] Colizza, V., Flammini, A.,Maritan, A., Vespignani, A. 

(2005) Physica A 352, 1-27. 
[12] Zhou, S. & Mondragon,R.J. (2004) IEEE Commun. Lett. 

8, 180-182. 

[13] Colizza, V., Flammini, A., Serrano, M.A. & Vespignani, 

A. (2006) Nature Physics 2,110-115. 
[14] Bollobas, B., (1998) Modern Graph Theory (Springer 

Verlag, New York). 



[15] Tong, A.H.Y., Drees, B., Nardelli, G, Bader, G.D., 
Brannetti, B., Castagnoli, L., Evangelista, M., Ferracuti, 
S., Nelson, B., Paoluzi, S. et al. (2002) Science 295, 321- 
324. 

[16] Bader, GD. & Hogue, C.W.V. (2002) Nature Biotech- 
nology 20, 991-997. 

[17] Bader, GD. & Hogue, C.W.V. (2003) BMC Bioinformat- 
ics, 4(2) 

[18] Altaf-Ul-Amin, M., Nishikata, K., Koma, T., Miyasato, 
T., Shinbo, Y., Arifuzzaman, M., Wada, G, Maeda, M., 
Oshima, T., Mori, H. et al. (2003) Genome Informatics, 
14, 498-499. 

[19] Wuchty, S. & Almaas, E. (2005) Proteomics, 5(2), 444- 
449. 

[20] Shannon, C. E., (1949) Proc. IRE 37, 10-21. 
[21] Almirantis, Y. and Provata, A. (1999) J. Stat. Phys, 97, 
233-262. 

[22] Erdos, P. & Renyi, A., (1960) Publ. Math. Inst. Hung. 
Acad. Set. 5, 17-60. 

[23] Jeong, H., Tombor, B., Albert, R., Oltvai, Z.N., 
Barabasi, A.-L. (2000) Nature 407, 651-654; Albert, R., 
Jeong, H., Barabasi, A.-L., (1999) Nature 401, 130-131. 

[24] Teixeira, M.C., Monteiro, P., Jain, P., Tenreiro, S., Fer- 
nandes, A.R., Mira, N.P., Alenquer, M., Freitas, A.T., 
Oliveira, A.L., Correia, I. (2006) Nucl. Acids Res. 34, 
D446-451. 

[25] Alvarez-Hamelin, I., DalPAsta, L., Barrat, L., Vespig- 
nani, A. Arxiv preprint cs.NI/0504107 

[26] Wagner, A. (2001) Mol. Bio. Evol. 18, 1283. 

[27] §engiin, Y., Erzan, A. (2006) Physica A 365, 446-462. 

[28] Albert, R. and Barabasi, A.-L.(2002) Rev. Mod. Phys. 
74, 47-97. 



7 



Supplementary Material 1 

Comparison with yeast data from different data bases 





FIG. 6: The network statistics extracted from the sources listed in Table|I]superposed on the simulation results corresponding to 
100 realizations of the model network (black dots). The agreement is extremely good with all of these sets of data, which almost 
completely cover, but do not exceed the phase space of our model. (Black, red, blue, green yellow and maroon correspond to 
the model, Yeastract, Fraenkel Lab, Kirdar and Luscombe data respectively). 
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Supplementary Material 2 

Comparison with Randomized Networks 

To double check the significance of our other results, we also compared the clustering coefficients, the degree-degree 
correlations and the rich-club coefficients of the Yeastract data with those obtained after the randomly reconnecting 
the edges of the network while keeping the degree of each node fixed. In this process, the directionality of the bonds 
is ignored. The comparison of the topological coefficients of the randomized yeast and randomized model networks 
with that of the yeast network, as shown in Fig. Q, confirm that the observed agreement between the yeast and 
models networks is not spurious. 
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FIG. 7: a) The clustering coefficient, b) the degree-degree correlations between neighboring nodes, and c) the rich-club coefficient 
of Yeastract data (red circles) compared with the results for the same obtained by randomizing the Yeastract data (red dots) 
and randomizing a realization of the model network (black dots), keeping the degrees of the individual nodes, and thereby the 
degree distributions, fixed. 

In Fig. [S] we display the effect of performing the same randomization procedure as described above, on the k- 
core plots. It is instructive to note that while in the yeast and model networks, a large fraction of connections is 
between nearby shells, the situation is reversed in the randomized networks, where there is a high degree of intra-shell 
connectivity as can be seen from Fig. El 




FIG. 8: The fc-core analysis of the randomized versions of the model (left panel) and Yeastract (right panel) networks yield 
results that differ quantitatively and qualitatively from the originals. The number of shells have gone up to 29 from 9, and the 
much higher intra-shell rather than inter-shell connectivity (as can be seen by following the edges) indicates that the hierarchical 
nature of the yeast network, which is faithfully reproduced by the model, is destroyed by the randomization process. 
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Supplementary Material 3 

The k-core structure of the Balcan-Erzan and Barabasi- Albert Networks 

In Fig. OH we show the k-core structure of the Balcan-Erzan Q and Barabasi- Albert network, as models for 
complex networks. Note the absence of well-defined hierarchical structures. 




FIG. 9: The fc-core analysis of the content-based network of Balcan and Erzan Q (left panel) and the Barabasi- Albert (BA) 
model |2S|. In the left panel, the total length of the single sequences associated with all of the nodes is L — 15000. The 
individual sequences obey the length distribution p(l) oc q l , with q = 0.95. The BA model network (right panel) has 5000 
nodes, and is built by starting from a fully connected four-cluster and adding nodes with two edges at a time. In the fc-core 
plot for the latter, only % 5 of the edges are shown for better visibility. 
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Supplementary Material 4 

Ranking of overlapping sets of regulated genes and motif inclusion 

We here report a statistical fact in support of the basic assumption underlying our model. The matching condition 
we employ dictates a certain correlation between the sets of regulated genes by each TF: if the binding motif of a 
TF (A) is embedded in that of a TF (B), then the set of genes {Gi}g regulated by TFg in our model is a subset of 
{Gi}j^. A similar investigation of the yeast databases listed below reveals that the top 50% of the TF pairs related 
by the motif inclusion relation above, rank in the top 3% when all the TF pairs are listed according to the overlap of 
their {G^} sets. The actual ranking of the TF pairs obtained among all possible pairs of 102 TFs with known binding 
motifs is shown in Fig. 1101 
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FIG. 10: Correlation between the sets of proteins regulated by the TFs with similar binding motifs. The vertical axis is the 
percentage overlap of the two sets of genes regulated by an arbitrary pair of TFs, which are ranked on the horizontal axis 
according to their overlap. The red vertical lines mark those pairs of TFs that are also related by binding motif inclusion. The 
accumulation of the red lines to the left of the graph is indicative of the correlation described in the text. 



On the other hand, the more straighforward expectation that TFs with short binding motifs should regulate more 
genes is not verified by the same data. This curious fact probably points to certain sequence correlations arising from 
the duplication and divergence processes |26( that distort the occurance statistics of the binding motifs in PRs. Note 
that the result in Fig. ^] is robust to such deviations from the unbiased probabilities for the occurance of different 
strings. 



